9
LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs
Title CN
LLaVA-UHD v3:用于多模态大模型中高效原生分辨率编码的渐进式视觉压缩
Keywords
渐进式视觉压缩 · 高分辨率处理 · 视觉令牌压缩 · 多模态大模型 · ViT架构优化
Summary
本文提出了LLaVA-UHD v3,基于新设计的渐进式视觉压缩(PVC)方法,实现高效的原生分辨率视觉编码。PVC包含两个模块:支持灵活块大小缩放的精细化块嵌入,以及跨ViT层层次部署的窗口化令牌压缩,能够在保留预训练ViT通用性的同时大幅提升推理效率。实验表明,ViT-UHD在性能上媲美MoonViT,并将首令牌时间(TTFT)减少2.4倍;LLaVA-UHD v3性能与Qwen2-VL相当,TTFT进一步降低1.9倍。
Reason
该论文提出了一种名为渐进式视觉压缩(PVC)的方法,通过精细化的块嵌入和分层窗口化令牌压缩,在保持原生高分辨率输入的同时显著降低计算开销。这一方法直接针对文档图像理解中的核心痛点——高分辨率处理与细粒度定位,且其ViT-UHD架构可无缝集成到现有VLM中,提升OCR-free文档理解的效率与精度。属于DeepSeek-OCR所倡导的‘视觉压缩’路线的重要进展,具有范式迁移意义。
Abstract
Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.
Authors
Shichu Sun, Yichen Zhang, Haolin Song, Zonghao Guo, Chi Chen, Yidan Zhang, Yuan Yao, Zhiyuan Liu, Maosong Sun
Categories
Artificial Intelligence, Computer Vision and Pattern Recognition